Improving prediction of drug-target interactions based on fusing multiple features with data balancing and feature selection techniques

Drug discovery relies on predicting drug-target interaction (DTI), which is an important challenging task. The purpose of DTI is to identify the interaction between drug chemical compounds and protein targets. Traditional wet lab experiments are time-consuming and expensive, that’s why in recent years, the use of computational methods based on machine learning has attracted the attention of many researchers. Actually, a dry lab environment focusing more on computational methods of interaction prediction can be helpful in limiting search space for wet lab experiments. In this paper, a novel multi-stage approach for DTI is proposed that called SRX-DTI. In the first stage, combination of various descriptors from protein sequences, and a FP2 fingerprint that is encoded from drug are extracted as feature vectors. A major challenge in this application is the imbalanced data due to the lack of known interactions, in this regard, in the second stage, the One-SVM-US technique is proposed to deal with this problem. Next, the FFS-RF algorithm, a forward feature selection algorithm, coupled with a random forest (RF) classifier is developed to maximize the predictive performance. This feature selection algorithm removes irrelevant features to obtain optimal features. Finally, balanced dataset with optimal features is given to the XGBoost classifier to identify DTIs. The experimental results demonstrate that our proposed approach SRX-DTI achieves higher performance than other existing methods in predicting DTIs. The datasets and source code are available at: https://github.com/Khojasteh-hb/SRX-DTI.


Introduction
The main phase in the drug discovery process is to identify interactions between drugs and targets (or proteins), which can be performed by in vitro experiments. Identifying drug-target interaction plays a vital role in drug development that aims to identify new drug compounds for known targets and find new targets for current drugs [1,2]. The expansion of the human a1111111111 a1111111111 a1111111111 a1111111111 a1111111111 genome project has provided a better diagnosis of disease, early detection of certain diseases, and identifying drug-target interactions (DTIs) [3]. Although significant efforts have been done in previous years, only a limited number of drug candidates have been permitted to reach the market by the Food and Drug Administration (FDA) whereas the maximum number of drug candidates have been rejected during clinical verifications, due to side effects or low efficacy [4]. Moreover, the cost of a new chemistry-based drug is often 2.6 billion dollars, and it takes typically 15 years to finish the drug development and approval procedure. This issue has been changing into a bottleneck to identifying the targets of any candidate drug molecules [2,5]. The experiment-based methods involve high cost, time-consuming, and small-scale limitations that motivate researchers to constantly develop computational methods for the exploitation of new drugs [2,6,7]. These computational methods offer a more efficient and costeffective approach to drug discovery, allowing researchers to explore a larger range of potential drug candidates and predict their efficacy before investing significant resources into experimental testing. On the other side, the availability of online databases in this area, such as KEGG [8,9], DrugBank [10], PubChem [11], Davis [12], TTD [13,14], and STITCH [15] have been influencing Machine Learning (ML) researchers to develop high throughput computational methods.
Drug discovery involves identifying molecules that can effectively target and modulate the function of disease-related proteins. Besides developing computational methods for predicting drug-target interactions (DTIs), studying protein-protein interactions (PPIs) has also become a top priority for drug discovery, especially due to the SARS-CoV-2 pandemic [16][17][18][19]. Proteins are responsible for various essential processes in vivo via interactions with other molecules. Dysfunctional proteins are often responsible for diseases, making them crucial targets for the drug discovery process [20,21]. Abnormal PPIs can support the development of lifethreatening diseases like cancer, further emphasizing the importance of identifying critical proteins and their interactions. Therefore, developing computational methods for identifying critical proteins in PPIs has become an important branch of drug discovery and treatment development [21,22]. In summary, understanding both DTIs and PPIs is critical for successful drug discovery. While this paper focuses on DTI prediction, it is important to consider PPI analysis as well in order to identify potential drug targets and improve the efficacy of drug development efforts.
The prior methods in DTI prediction can be mainly categorized into similarity-based methods and feature-based methods. In similarity-based methods, similar drugs or proteins are considered to find similar interaction patterns. These methods use many different similarity measures based on drug chemical similarity and target sequence similarity to identify drug-target interaction [23][24][25]. Feature-based methods consider drug-target interaction prediction as a binary classification problem and different classification algorithms such as Support Vector Machine (SVM) [26], random forest [27], rotation forest [28,29], XGBoost [30], and deep learning [31][32][33][34][35] have been employed to identify new interactions.
Various machine learning (ML) methods have been applied for drug-target prediction. Mousavian et al. utilized a support vector machine with features extracted from the Position Specific Scoring Matrix (PSSM) of proteins and molecular substructure fingerprint of drugs [26]. Shi et al. presented the LRF-DTIs method based on random forest, using pseudo-position specific scoring matrix (PsePSSM) and FP2 molecular fingerprint to extract features from proteins and drugs, and employing Lasso dimensionality reduction and Synthetic Minority Oversampling Technique (SMOTE) to handle unbalanced data [27]. Wang et al. proposed two methods based on Rotation Forest: RFDT, which used a PSSM descriptor and drug fingerprint as feature vectors [29], and RoFDT, which combined feature-weighted Rotation Forest (FwRF) with protein sequence encoded as PSSM, and drug structure fingerprints [28]. These methods have shown promising results in predicting DTIs. Moreover, Mahmud et al. [30] proposed a computational model, called iDTi-CSsmoteB for the identification of DTIs. They utilized PSSM, amphiphilic pseudo amino acid composition (AM-PseAAC), and dipeptide PseAAC descriptors to present protein and molecular substructure fingerprint (MSF) to present drug molecule structure. Then, the oversampling SMOTE technique was applied to handle the imbalance of datasets, and the XGBoost algorithm as a classifier to predict DTIs.
The increase in the volume and diversity of data has led to the development of various deep learning platforms and libraries, such as DeepPurpose [32] and DeepDrug [35]. DeepPurpose [32] takes the SMILES format of the drug and amino acid sequence of the protein as input and transforms it into a specific format using a specific function. This format is then converted into a vector representation to be used in subsequent steps. This library provides eight encoders using different modalities of compounds, as well as utility functions to load pre-trained models and predict new drugs and targets. Yin et al. [35] proposed another deep learning framework called DeepDrug. Furthermore, variants of graph neural networks such as graph convolutional networks (GCNs) [35], graph attention networks (GATs) [36,37], and gated graph neural networks (GGNNs) [31,33,34] have been developed for DTI prediction.
We introduce SRX-DTI, a novel ML-based method for improving drug-target interaction prediction. First, we generate various descriptors for protein sequences, including Amino Acid Composition (AAC), Dipeptide Composition (DPC), Grouped Amino Acid Composition (GAAC), Dipeptide Deviation from Expected Mean (DDE), Pseudo Amino Acid Composition (PseAAC), Pseudo-Position-Specific Scoring Matrix (PsePSSM), Composition of K-spaced Amino Acid Group Pairs (CKSAAGP), Grouped Dipeptide Composition (GDPC), and Grouped Tripeptide Composition (GTPC). The drug is encoded as FP2 molecular fingerprint. Second, we use the technique namely Under Sampling by One-class Support Vector Machine (One-SVM-US) to balance the data, and the positive and negative samples are constructed using drug-target interaction information on the extracted features. Then, we perform the FFS-RF algorithm to select the optimal subset of features. Finally, after comparing various ML classifiers, we choose the XGBoost classifier to predict DTIs using 5-Fold cross-validation (CV). We evaluate the performance of our method using several metrics, including AUROC, AUPR, ACC, SEN, SPE, and F1-score. Our method achieves high AUROC values of 0.9920, 0.9880, 0.9788, and 0.9329 for EN, GPCR, IC, and NR, respectively. These results demonstrate that SRX-DTI outperforms existing methods for DTI prediction.
The rest of the paper is organized as follows: Materials and methods section describes the detail of the gold standard datasets, feature extraction, data balancing, and feature selection, we utilized in this paper. In the Results and discussion section, performance evaluation and experimental results are provided. Finally, the Conclusions section summarizes the conclusions.

Materials and methods
In this study, we propose a novel method of drug-target interaction prediction, which is called SRX-DTI. In the first step, drug chemical structures (SMILE format) and protein sequences (FASTA format) are collected from DrugBank and KEGG databases using their specific access IDs. In the next step, different feature extraction methods are applied to drug compounds and protein sequences to create a variety of features. Drug-target pair vectors are made based on known interactions and extracted features. Afterward, a balancing technique is utilized on DTI vectors to deal with imbalanced datasets, and drug-target features are selected through the FFS-RF to boost prediction performance. Finally, the XGBoost classifier is used on the balanced datasets with optimal features to predict DTIs. A schematic diagram of our proposed SRX-DTI model is shown in Fig 1.

Drug-Target datasets
In this research, four golden standard datasets, including enzymes (EN), G-protein-coupled receptors (GPCR), ion channel (IC), and nuclear receptors (NR) released by Yamanishi et al. [38] are explored as benchmark datasets to evaluate the performance of the proposed SRX-DTI method in DTI prediction. All these datasets are freely available from http://web.kuicr.kyotou.ac.jp/supp/yoshi/drugtarget/. Yamanishi et al. [38] extracted information about drug-target interactions from DrugBank [39], KEGG [8,9], BRENDA [40], and SuperTarget [41]. The numbers of known interactions including enzymes, ion channels, GPCRs, and nuclear receptors are 2926, 1476, 635, and 90 respectively. The SRX-DTI model is also evaluated on the Davis Kinase binding affinity dataset [12]. The original Davis dataset represents 30,056 affinity bindings interactions between 442 proteins and 68 drug molecules. Here, we filter the dataset by removing all interactions with affinity < 7, resulting in the dataset used in this research. Finally, 2502 interactions are considered between proteins and drug molecules in the Davis dataset. A brief summary of these datasets is given in Table 1.

Feature extraction methods
In order to better identify drug-protein interactions, it seems advantageous to extract different features from drugs and targets. This allows us to have more complete information about the known interactions and increase the detection rate. A brief summary of the ten groups of features is given in Table 2. Notice that there are two types of features. Drug related features and target related features in nine groups A, B, C, D, E, F, G, H, and I. In the following, these features are described, respectively. Whereas data diversity in the predictive models is very important, various subsets of these groups have been examined to select appropriate subsets. Based on drug and target descriptors, we constructed four subsets of features (AB, CD, EF, and GHI), which are given in Table 3. Also, notice that the drug features are coupled with singular target groups and these subsets. These four subsets have been selected to preserve certain properties of whole feature groups and at the same time, keep diversity in them.

Drug features
For drug compounds, different types of descriptors can be defined based on various types of drug properties such as FP2, FP3, FP4, and MACCS [42][43][44]. Some studies showed that these descriptors are molecular structure fingerprints that effectively represent the drug [27,45,46]. In this study, the FP2 format fingerprint is used to present drug compounds. This molecular fingerprint of the drug was extracted through these steps: Step 1: For each drug, molecular structure as mol format is downloaded from the KEGG database (https://www.kegg.jp/kegg/drug/) by using its drug ID.
Step 3: The drug molecules with mol file format are converted into the FP2 format molecular fingerprint using the OpenBabel software. The FP2 format molecular fingerprint is a hexadecimal digit sequence of length 256 that is converted to a drug molecule 256-dimensional vector as a decimal digit sequence between 0 and 15.

Target features A) Amino acid composition (AAC):
The amino acid composition [47] is a vector of 20 dimensions, which calculates the frequencies of all 20 natural amino acids (i.e. "ACDEF-GHIKLMNPQRSTVWY") as: where N(t) is the number of amino acid type t, while N is the length of a protein sequence.

B) Dipeptide composition (DPC):
The Dipeptide Composition [48] gives 400 descriptors for protein sequence. It is calculated as: where N rs is the number of dipeptides represented by amino acid types r and s and N denotes the length of protein.

C) Grouped Amino Acid Composition (GAAC):
In the GAAC encoding [49], the 20 amino acid types are considered five classes according to their physicochemical properties. GAAC descriptor is the frequency of each amino acid group, which is calculated as: where N(g) is the number of amino acids in group g, N(t) is the number of amino acid type t, and N is the length of protein sequence.

D) Dipeptide Deviation from Expected mean (DDE):
The Dipeptide Deviation from Expected mean [48] is a feature vector, which is constructed by computing three parameters, i.e. dipeptide composition (D c ), theoretical mean (T m ), and theoretical variance (T v ). These three parameters and the DDE are defined as follows. D c (r, s), the dipeptide composition measure for the dipeptide 'rs', is given as: where N rs is the number of dipeptides represented by amino acid types r and s and N is the length of protein. T m (r, s), the theoretical mean, is given by: where C r is the number of codons, coding for the first amino acid, and C s is the number of codons, coding for the second amino acid in the given dipeptide 'rs' and C N is the total number of possible codons. T v (r, s), the theoretical variance of the dipeptide 'rs', is given by: Finally, DDE(r, s) is calculated as:

E) Pseudo Amino Acid composition (PseAAC):
To avoid completely losing the sequenceorder information, the concept of PseAAC (pseudo amino acid composition) was proposed by Chou [50]; The idea of PseAAC has been widely used in bioinformatics including proteomics [51], system biology [52], such as predicting protein structural class [53], predicting protein subcellular localization [54], predicting DNA-binding proteins [55] and many other applications. In contrast with AAC which includes 20 components with each reflecting the occurrence frequency for One of the 20 native amino acids in a protein, the PseAAC contains a set of greater than 20 discrete factors, where the first 20 represent the components of its conventional amino acid composition while the additional factors are a series of rank-different correlation factors along a protein chain. According to the concept of PseAAC [50], any protein sequence formulated as a PseAAC vector given by: where L is the length of protein sequence, and λ is the sequence-related factor that choosing a different integer for, will lead to a dimension-different PseAAC. Each of the components can be defined as follows: where w is the weight factor, and f i indicates the frequency at i − th AA in protein sequence. The τ k , the k-th tier correlation factor reflects the sequence order correlation between all the kth most contiguous residues as formulated by: where F q (R i ) is the q-th function of amino acid R i , and Γ is the total number of the functions considered. In this research, the protein functions which are considered, includes hydrophobicity value, hydrophilicity value, and side chain mass of amino acid. Therefore, the total number of functions Γ is 3. In this study, λ is set to 1 and W is set to 0.05. The output characteristic dimensions of each target protein are 28 for the PseAAC descriptor. F) Pseudo position specific scoring matrix (PsePSSM): To represent characteristics of the amino acid (AA) sequence for protein sequences, the pseudo-position specific scoring matrix (PsePSSM) features introduced by Shen et al. [56] are used. The pseudo-position specific scoring matrix (PsePSSM) features encode the protein sequence's evolution and information which have been broadly used in bioinformatics research [16,56,57].
For each target sequence P with L amino acid residues, PSSM is used as its descriptor proposed by Jones et al. [58]. The position-specific scoring matrix (PSSM) with a dimension of L×20 can be defined as: where M i,j indicates the score of the amino acid residue in the ith position of the protein sequence being mutated to amino acid type j during the evolution process. Here, for simplifying the formulation, it is used the numerical codes 1, 2,. . ., 20 to represent the 20 native amino acid types according to the alphabetical order of their single character codes. It can be searched using the PSI-BLAST [59] in the Swiss-Prot database. A positive score shows that the corresponding residue is mutated more frequently than expected, and a negative score is just the contrary.
In this work, the parameters of PSI-BLAST are set as the threshold of E-value equals 0.001, the maximum number of iterations for multiple searches equals 3, and the rest of the parameters by default. Each element in the original PSSM matrix was normalized to the interval (0, 1) using Eq (14): However, due to different lengths in target sequences, making the PSSM descriptor as a uniform representation can be helpful, one possible representation of the protein sample P is: where T is the transpose operator, and where M i!j is the average score of the amino acid residues in the protein P changed to the jth amino acid residue after normalization, M j represents the average score of the amino acid residue in protein P being mutated amino acid type j during the process of evolution. However, if P PSSM of Eq (13) represents the protein P, all the sequence-order information would be lost. To avoid complete loss of the sequence-order information, the concept of the pseudo amino acid composition introduced by Chou [60], i.e. instead of Eq (11), we use position-specific scoring matrix (PsePSSM) to represent the protein P: where where G l j represents the correlation factor of the j -th amino acid and λ is the continuous distance along the protein sequence. This means that G 1 j is the relevant factor coupled along the most continuous PSSM score on the protein chain of amino acid type j, G 2 j is the second closest PSSM score by coupling, and so on. Therefore, a protein sequence can be defined as Eq (15) using PsePSSM and produces a 20 + 20 × λ-dimensional feature vector. In this study, λ is set to 10. The output characteristic dimension of each target protein is 220 for the PsePSSM descriptor.

G) Composition of k-spaced amino acid group pairs (CKSAAGP):
The Composition of k-Spaced Amino Acid Group Pairs (CKSAAGP) [61] defines the frequency of amino acid group pairs separated by any k residues (the default maximum value of k is set as 5). If k = 0, the 0-spaced group pairs are represented as: where the value of each descriptor indicates the composition of the corresponding residue group pair in a protein sequence. For a protein of length P and k = 0, 1, 2, 3, 4 and 5, the values of N total are P-1, P-2, P-3, P-4, P-5 and P-6 respectively.

H) Grouped dipeptide composition (GDPC):
The Grouped Di-Peptide Composition encoding [61] is a vector of 25 dimensions, which is another variation of the DPC descriptor. It is defined as: where N rs is the number of dipeptides represented by amino acid types r and s and N denotes the length of a protein.

I) Grouped tripeptide composition (GTPC):
The Grouped Tri-Peptide Composition encoding [61] is also a variation of the TPC descriptor, which generates a vector of 125 dimensions, defined as: where N rst is the number of tripeptides represented by amino acid types r, s and t. N denotes the length of a protein.

Data balancing technique
The experiment datasets that we used in this study were highly imbalanced. Imbalanced datasets can present a challenge for many machine learning algorithms, as they may prioritize the majority class and ignore the minority class, leading to poor performance on the minority class. Different techniques have been utilized to balance the imbalanced dataset, such as random undersampling [26,62,63], cluster undersampling [64,65], and SMOTE technique [27,30]. To address the issue of imbalanced data in our study, we developed a new undersampling algorithm called One-SVM-US, which uses One-class Support Vector Machine (SVM) to deal with imbalanced data. The steps of the One-SVM-US algorithm were implemented as Algorithm 1. In the first step, the known DTIs are considered positive samples. For enzymes, ion channels, GPCRs, nuclear receptors, and the Davis dataset, the number of positives is 2926, 1476, 635, 90, and 2502, respectively. In the next step, the algorithm considers all of the possible interactions in five datasets as negative samples except the ones that have been known as positive. By performing the One-SVM-US algorithm, it would result in a balanced dataset with equal numbers of positive and negative samples.
A One-Class Support Vector Machine (One-class SVM) [66], is a semi-supervised global anomaly detector. This algorithm needs a training set that contains only one class. The One-SVM-US technique based on One-class SVM considers all possible combinations of drug and target by discarding those that are positive samples. This algorithm uses a hypersphere to encompass all of the instances instead of using a hyperplane to separate two classes of samples. We apply the RBF kernel for SVM. The setting for the parameter γ was investigated, which was the simple heuristic γ = 1/no. of data points. To compute the outlier score, first, the maximum value of the decision function is obtained by: where x refers to the vector of scores. Then, we obtained the outlier score as follows: Then, the outlier scores are sorted in ascending and the n minority samples are selected from the sorted list. The final data is constructed from the combination of the minority class from the original experimental dataset and the majority class chosen by the proposed method. Even though, we would like to mention that Algorithm 1 performs effectively to make balanced datasets.

Feature selection technique
Considering that reducing the number of input features can lead to both reducing the computational cost of modeling and, in some cases, improving the performance of the model. We develop a feature selection algorithm with RF, called FFS-RF. This algorithm was developed and implemented based on the forward feature selection (FFS) technique [67] that coupled with RF to obtain optimal features in DTI. The RF approach [68] is an ensemble method that combines a large number of individual binary decision trees. The performance of the RF model in feature selection was evaluated by a 5-fold CV to construct an effective prediction framework. Forward feature selection is an iterative process, which begins with an empty set of features. After each iteration, it keeps adding on a feature and evaluates the performance to check whether it is improving the performance or not. The FFS-RF technique continues until the addition of a new feature does not improve the performance of the model, as outlined in Algorithm 2 step by step.

Results and discussion
In this section, we explain the experimental results of our proposed method in DTI prediction. We implemented all the phases, i.e., features extraction, data balancing, and classifiers of the proposed model in Python language (Python 3.10 version) using the Scikit-learn library. Some of the target descriptors were calculated by the iFeature package [61] and the rest of them were implemented in Python language. OpenBabel Software was used to extract fingerprint descriptors from drugs. All of the implantations were performed on a computer with a processor 2.50 GHz Intel Xeon Gold 5-2670 CPU and 64 GB RAM.

Performance evaluation
Most of the methods in DTI prediction [5,6,26,30] have utilized 5-fold cross validation (CV) to assess the power of the model to generalize. We also use the 5-fold CV to estimate the skill of the SRX-DTI model on new data and make a fair comparison with the other state-of-the-art methods. The drug-target datasets were split into 5 subsets where each subset was used as a testing set. In the first iteration, the first subset is used to test the model and the rest are used to train the model. In the second iteration, 2nd subset is used as the testing set while the rest serves as the training set. This process is repeated until each fold of the 5 folds is used as the testing set. Then, the performance is reported as the average of the five validation results for drug-target datasets.
In this study, we perform three types of analyses. First, the importance of feature extraction is discussed. Secondly, we investigate the impact of our balancing technique (One-SVM-US) versus the random undersampling technique on CV results. Finally, the effectiveness of the feature selection method is analyzed.
We used the following evaluation metrics to assess the performance of the proposed model: accuracy (ACC), sensitivity (SEN), specificity (SPE), and F1 Score.

The effectiveness of feature groups
We constructed nine different feature groups namely A, B, C, D, E, F, G, H, and I, which all were coupled with drug features to assess the effects of the different sets of features on the performance of the different classifiers including SVM, RF, MLP, and XGBoost. The feature groups have already been reported in Table 2. We also created some subsets from the groups (AB, CD, EF, and GHI), which are given in Table 3. The selection of the best combination can be considered an optimization problem. Here, we combine feature descriptors based on nonmonotonic information and the performance results we get for different classifiers in single feature groups.
We performed experiments to test the effectiveness of the feature groups. In the experiments, we changed the feature groups and applied the random undersampling technique to balance datasets. Statistics of the prediction performance for different classifier models are given in Tables 4 and 5.
Focus on the EN dataset, we compared the DTI prediction performance of four different classifiers on nine feature groups and four subsets of them. We also highlighted several possible characteristics that could be considered to select the best classifier in DTI prediction. The results indicated that XGBoost is competitive in predicting interactions. We also made some subsets from single groups namely: AB, CD, EF, and GHI. Two classifiers include MLP and XGBoost had close performance and outperforms other ML methods to predict DTIs.

The influence of the data balancing techniques
Imbalanced data classification is a significant challenge for predictive modeling. Most of the machine learning algorithms used for classification were designed around the assumption of an equal number of samples for each class. Imbalanced data lead to biased prediction results in  Tables 6-8, which reveal the efficiency of the One-SVM-US algorithm. We observe from Table 6   There is a similar pattern in group EF, which is shown in Table 7. In the case of EN, the prediction results of ACC, SEN, SPE, and F1 on balanced data with One-SVM-US are 0.9901, 0.9947, 0.9967, and 0.9956, which are 0.1895, 0.1456, 0.208, and 0.176 higher than those balanced with Random undersampling, respectively. These prediction results show that the One-SVM-US technique obtains a comparatively advantageous performance. In the case of GPCR, IC, and NR datasets, the ACC, SEN, SPE, and F1 results for balanced data with One-SVM-US and balanced with Random undersampling are in Table 6. The values of these metrics are also shown in Table 7 for group EF. To better analyze the proposed methods, the ROC curves of two data balancing techniques are shown in Fig 2a-2d. These curves demonstrate discriminative ability in group AB, the ROC curve using the One-SVM-US covers the largest area, which is higher than the Random undersampling. The ROC curves of group EF are also shown in Fig  3a-3d, which also cover the larger area in the One-SVM-US technique in comparison with the Random undersampling technique.
We can see from Table 8 that the model performance on balanced with Random undersampling and balanced with One-SVM-US on the Davis dataset. It can be observed that the proposed One-SVM-US exhibits a similar performance in all datasets. For the Davis dataset, the model AUROC values are 0.9786, 0.9839, 0.9756, and 0.9696 in groups AB, CD, EF, and GHI, respectively. For each feature group, the One-SVM-US technique performs better in terms of AUPR 0.9848, 0.9896, 0.9835, and 0.9781 for groups AB, CD, EF, and GHI, respectively. These results demonstrate that the balanced dataset using One-SVM-US significantly outperforms the balanced dataset using Random undersampling in the case of ROC curves. The accuracy of the XGBoost classifier has been improved after utilizing the One-SVM-US. For all five datasets on the SEN, SPE, and F1 metrics, the results are significantly better in One-SVM-US. Ultimately, One-SVM-US is the efficient method to make balanced datasets to reduce bias and boost the model's performance.

The effectiveness of feature selection technique
Feature selection is extremely important in ML because it primarily serves as a fundamental technique to direct the use of informative features for a given ML algorithm. Feature selection techniques are especially indispensable in scenarios with many features, which is known as the curse of dimensionality. The solution is to decrease the dimensionality of the feature space via a feature selection method. A feature selection technique by selecting an optimal subset of features reduces the computational cost. Various feature selection techniques have been utilized in DTI prediction [1,6,64]. The wrapper-based methods refer to a category of supervised feature selection methods that uses a model to score different subsets of features to finally select the best one. Forward selection is one of the Wrapper based methods, which starts from a null model with zero features and adds them greedily one at a time to maximize the model performance. Here, we use the FFS-RF algorithm to find the optimal subset and maximize performance. Table 9 indicates the performance results of FFS-RF on the EN dataset in groups AB, CD, EF, and GHI. Table 9 shows ACC, AUROC, and AUPR metrics of the FFS-RF method which reduces the input features to the model. The worth of the FFS-RF is clearly observable; For the EN dataset, we just use 8 features instead of 676 features in group AB, 10 features instead of

Selection of predictor model
In this study, we focus on four classifiers: SVM, Random Forest (RF), MLP, and XGBoost. To evaluate these classifier models, we apply Cross Validation (CV) technique to select an appropriate predictor model for our problem. The results of the different predictive models are shown for the EN dataset in group AB in Table 12. To make an obvious comparison of prediction effects, the results are also demonstrated as a bar graph for the EN dataset in Fig 4. Comparison among the prediction results of the EN dataset from To make a better evaluation, we compare the DTI prediction performance of classifier models using the benchmark Yamanishi and Davis datasets. For each classifier, we use the balanced datasets with optimal features to predict DTIs. Table 13 provides a comparison of the XGBoost for SRX-DTI, as the best performing method, and RF, as the second-best performing  [28], and Mahmud et al. [6]. The AUROCs generated by these models are listed in Table 15.
As seen in the table, the AUROC of the proposed model is superior in comparison with the AUROC of other methods in all the datasets.
Average AUROC values of SRX-DTI on EN, GPCR, IC, and NR are 0.9920, 0.9880, 0.9788, and 0.9329, respectively. It should be considered that most of the existing models are without a feature selection phase [26,28,29,70,71]. Training the model with more features can lead to overfitting and reduce the power of generalization in the model. Whereas we can achieve the AUROC of 0.9920 in group AB by using just eight features instead of using all 676 features. This is significantly valuable in terms of computational cost. Moreover, our balancing method superlatively addresses the imbalance problem in the datasets, and feature selection techniques select an optimal subset of features for five datasets. Ultimately, the XGBoost classifier is so scalable that can perform better in comparison with other classifiers for identifying the new DTIs.

Conclusion
The identification of drug-target interactions through experimentation is a costly and timeconsuming process. Therefore, the development of computational methods for identifying interactions between drugs and target proteins has become a critical step in reducing the search space for laboratory experiments. In this work, we proposed a novel framework for predicting drug-target interactions. Our approach is unique in that we use a variety of descriptors for target proteins. We implement the One-SVM-US technique to address unbalanced data. The most important advantage of the proposed method is developing the FFS-RF algorithm to find an optimal subset of features to reduce computational cost and improve prediction performance. We also compare the performance of four classifiers on balanced datasets with optimal features, ultimately selecting the XGBoost classifier to predict DTIs in our model. We then employ the XGBoost classifier to predict DTIs on five benchmark datasets. Our SRX-DTI model achieved good prediction results, which showed that the proposed method outperforms other methods to predict DTIs. The only limitation of this work can be the necessity of feature engineering in comparison with deep learning methods. However, the feature selection technique can also be considered a knowledge discovery tool that provides an understanding of the problem through the analysis of the most relevant features. On the other side, deep neural networks (DNNs) require large amounts of data to learn parameters, but our proposed model work on small data. This research showed that our robust framework is capable of capturing more potent and informative features among massive features. Furthermore, the proposed framework poses resistance against noise and it is a data-independent machine learning method.